Eliminating redundancy among protein sequences using submodular optimization
نویسندگان
چکیده
Motivation: Submodular optimization, a discrete analogue to continuous convex optimization, has been used with great success in many fields but is not yet widely used in biology. We apply submodular optimization to the problem of removing redundancy in protein sequence data sets. This is a common step in many bioinformatics and structural biology workflows, including creation of non-redundant training sets for sequence and structural models as well as selection of “operational taxonomic units” from metagenomics data. Results: We demonstrate that the submodular optimization approach results in representative protein sequence subsets with greater structural diversity than sets chosen by existing methods. In particular, we compare to a widely used, heuristic algorithm implemented in software tools such as CD-HIT, as well to as a variety of standard clustering methods, using as a gold standard the SCOPe library of protein domain structures. In this setting, submodular optimization consistently yields protein sequence subsets that include more SCOPe domain families than sets of the same size selected by competing approaches. We also show how the optimization framework allows us to design a mixture objective function that performs well for both large and small representative sets. The framework we describe is theoretically optimal under some assumptions, and it is flexible and intuitive because it applies generic methods to optimize one of a variety of objective functions. This application serves as a model for how submodular optimization can be applied to other discrete problems in biology. Availability: Source code is available at https://github.com/mlibbrecht/submodular_sequence_ repset. Contact: [email protected]
منابع مشابه
Some Results about the Contractions and the Pendant Pairs of a Submodular System
Submodularity is an important property of set functions with deep theoretical results and various applications. Submodular systems appear in many applicable area, for example machine learning, economics, computer vision, social science, game theory and combinatorial optimization. Nowadays submodular functions optimization has been attracted by many researchers. Pendant pairs of a symmetric...
متن کاملA Cost-efficient Rewriting Scheme to Improve Restore Performance in Deduplication Systems
In chunk-based deduplication systems, logically consecutive chunks are physically scattered in different containers after deduplication, which results in the serious fragmentation problem. The fragmentation significantly reduces the restore performance due to reading the scattered chunks from different containers. Existing work aims to rewrite the fragmented duplicate chunks into new containers...
متن کاملSummarization Through Submodularity and Dispersion
We propose a new optimization framework for summarization by generalizing the submodular framework of (Lin and Bilmes, 2011). In our framework the summarization desideratum is expressed as a sum of a submodular function and a nonsubmodular function, which we call dispersion; the latter uses inter-sentence dissimilarities in different ways in order to ensure non-redundancy of the summary. We con...
متن کاملDifferentiable Submodular Maximization
We consider learning of submodular functions from data. These functions are important in machine learning and have a wide range of applications, e.g. data summarization, feature selection and active learning. Despite their combinatorial nature, submodular functions can be maximized approximately with strong theoretical guarantees in polynomial time. Typically, learning the submodular function a...
متن کاملThe Construction of Huffman Codes is a Submodular ("Convex") Optimization Problem Over a Lattice of Binary Trees
We show that the space of all binary Huffman codes for a finite alphabet defines a lattice, ordered by the imbalance of the code trees. Representing code trees as path-length sequences, we show that the imbalance ordering is closely related to a majorization ordering on real-valued sequences that correspond to discrete probability density functions. Furthermore, this tree imbalance is a partial...
متن کامل